To predict ridership, we can use machine learning. There are supervised and unsupervised learning. One implementation to perform basic supervised learning is by using linear regression with gradient descent. It's trying to find out the best set of coefficients (thetas) that could be applied to features and generate predicted values as close to observed values.
To implement this algorithm, I'll need cost function, J(θ), to povide a measure of how well the current set of thetas performs. J(θ) returns a fraction of the sum of the squared differences, and we'll want to minimize the returned value of J(θ). h(Xi) is the function that returns preicted value with the current set of thetas.
Gradient descent is an algorithm that converges to the minimal J(θ). It updates the values of thetas according to the equation below. The α parameter here is the learning rate which stands for the smallest step to make in a certain direction to make J(θ) smaller. A larger α will comverge more quickily, but it's more prone to skip the minimal value of J(θ) and causes J(θ) to increase. One way to make sure that α s suitable is to keep trak of the history of J(θ). This is implemented in the function below.
Once we find the set of thetas, we need to evaluate the effectiveness of the model. One way is to measure the quantity of coefficiet determination, which is also called the R Squared. It's calculated by the following equation.
In [18]:
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('turnstile_weather_v2.csv', parse_dates=['datetime'])
df_entries_hourly = df['ENTRIESn_hourly']
In [19]:
'''
Normalize features to speed up computing process
'''
def normalize_features(df_features):
mu = df_features.mean()
sigma = df_features.std()
if (sigma == 0).any():
raise Exception("One or more features had the same value for all samples, and thus could " + \
"not be normalized. Please do not include features with only a single value " + \
"in your model.")
df_normalized = (df_features - mu) / sigma
return df_normalized, mu, sigma
In [20]:
'''
Compute the cost J(Theta) given a set of features, values and thetas (coefficients).
'''
def compute_cost(features, values, theta):
predicted_values = np.dot(features, theta)
sum_squared_errors = np.square(predicted_values - values).sum()
m = len(values)
cost = sum_squared_errors / (2 * m)
return cost
In [21]:
'''
Perform gradient descent by given features, values, initial thetas (coefficients), alpha (learning rate) and num of iterations.
Rreturn final thetas of minimal cost and cost history of each itration.
'''
def gradient_descent(features, values, theta, alpha, num_iterations):
m = len(values)
cost_history = []
for i in range(num_iterations):
predicted_values = np.dot(features, theta)
theta = theta + (alpha / m) * np.dot((values - predicted_values), features)
cost = compute_cost(features, values, theta)
cost_history.append(cost)
return theta, pd.Series(cost_history)
In [22]:
'''
Main function to initialize values, run gradient_descent and output predictions
'''
def calc_predictions(features, values, theta, alpha=0.1, num_iterations=75, \
plot_cost=False, print_theta=0):
m = len(values)
features, mu, sigma = normalize_features(features)
features['ones'] = np.ones(m) # Add a column of 1s (y intercept)
# Convert features and values to numpy arrays
features_array = np.array(features)
values_array = np.array(values)
# If initial theta is not specified, user zeros
if theta == None:
theta = np.zeros(len(features.columns))
# Initialize theta, perform gradient descent
theta, cost_history = gradient_descent(features_array,
values_array,
theta,
alpha,
num_iterations)
if plot_cost == True:
plot_cost_history(alpha, cost_history)
if print_theta > 0:
print '{0} thetas, first {1}: {2}'.format(len(theta), print_theta, list(theta[:print_theta]))
print 'Y-intercept = {0}'.format(list(theta[len(theta)-1:]))
predictions = np.dot(features_array, theta)
return predictions
In [23]:
'''
Compute R squared, coefficient of determination
'''
def compute_r_squared(data, predictions):
sum_square_errors = np.square(data - predictions).sum()
y_bar = np.mean(data)
sum_square_diffs = np.square(data - y_bar).sum()
r_squared = 1 - (sum_square_errors / sum_square_diffs)
return r_squared
In [24]:
'''
Return the column that contains 1 if the date is a holiday, otherwise 0
'''
def get_holiday():
#return np.where((df['DATEn'] == '05-30-11') | (df['DATEn'] == '05-31-11'), 1, 0)
return np.where(df['DATEn'] == '05-30-11', 1, 0)
In [25]:
'''
Plot the history of costs of all the iterations
'''
def plot_cost_history(alpha, cost_history):
num_iterations = len(cost_history)
f, ax = plt.subplots()
ax.plot(range(num_iterations), cost_history)
ax.set_xlabel('%i iterations' % num_iterations)
ax.set_ylabel('Cost')
ax.set_title('Cost history for alpha = %.2f' % alpha)
In [26]:
'''
Get (and plot a histogram of) the residuals, the difference between the original hourly entry data and the predicted values
'''
def get_residuals(values, predictions, plot=False, bin_size=100):
residuals = (values - predictions)
if plot == True:
k2, p = st.normaltest(residuals)
f, ax = plt.subplots()
ax.hist(residuals, bins=bin_size, normed=True)
residuals.hist(bins=bin_size)
ax.set_xlabel('Value minus predicion, bins = %i' % bin_size)
ax.set_ylabel('Frequency')
ax.set_title('Histogram of the residuals, mean = %.5f\n \
Normality test, k2 = %.5f, p = %.5f' % (residuals.mean(), k2, p))
return residuals
In [27]:
'''
Another way to see the normality of residuals. Generate a probability plot of sample data against the quantiles of a specified
theoretical distribution (the normal distribution by default)
'''
def probplot_residuals(residuals):
f, ax = plt.subplots()
st.probplot(residuals, plot=ax)
In [28]:
def plot_observed_predicted(observed, predicted, sample_size):
x_arange = np.arange(0, sample_size, 1)
f, ax = plt.subplots()
plt.plot(x_arange, observed[:sample_size], 'b', x_arange, predicted[:sample_size], 'g')
ax.legend(['Observed', 'Predicted'])
ax.set_xlabel('First %i values' % sample_size)
ax.set_ylabel('Hourly entries')
ax.set_title('Compare observed and predicted value samples')
In [29]:
def print_features_r2(str_features, df_entries_hourly, predictions):
print 'Features: [%s] R^2 = %.5f' % (str_features, compute_r_squared(df_entries_hourly, predictions))
I was trying different sets of features to see which could give me better R^2. Lines below were commented out to save time. Feel free to uncommet the lines and try yourself.
The results are:
Features: [hour, day_week] R^2 = 0.09159
Features: [hour, weekday] R^2 = 0.10385
Features: [hour, weekday, holiday] R^2 = 0.10913
Features: [hour, weekday, rain] R^2 = 0.10389
Features: [hour, weekday, fog] R^2 = 0.10418
Features: [hour, weekday, current] R^2 = 0.10626
Features: [hour, weekday, mean] R^2 = 0.11274
Features: [hour, weekday, conds] R^2 = 0.11058
Features: [hour, weekday, station] R^2 = 0.43015
Features: [hour, weekday, UNIT] R^2 = 0.48137
Features: [hour, weekday, current, holiday, UNIT] R^2 = 0.48855
Features: [hour, weekday, mean, holiday, UNIT] R^2 = 0.48674
Though the R^2 0.10389 for features: [hour, weekday, rain] is not the best, because of the simplicity (no dummy feature), I'm taking this as an example to explain the result thetas. The result model driven by this feature set is:
Ypredicted = (848.7024087859794 ∗ hour) + (430.90383721963343 ∗ weekday) + (20.621268321381383 ∗ rain) + 1885.89193865641
Below I'm calculating the Ypredicted manually with the thetas and normalized features to show how predicted value was done.
In [30]:
features = df[['hour', 'weekday', 'rain']]
predictions = calc_predictions(features, df_entries_hourly, None, print_theta=3)
print_features_r2('hour, weekday, rain', df_entries_hourly, predictions)
print 'predictions[10] = {0:,.5f}'.format(predictions[10])
df_normalized, mu, sigma = normalize_features(features)
df_norm_10 = df_normalized.iloc[[10],[0, 1, 2]]
print 'Normalized features of row 10: '
print df_norm_10
hour = df_norm_10.loc[10].loc['hour']
weekday = df_norm_10.loc[10].loc['weekday']
rain = df_norm_10.loc[10].loc['rain']
print '(848.7024087859794 * {0}) + (430.90383721963343 * {1}) + (20.621268321381383 * {2}) + 1885.891938656415 = {3:,.5f}'. \
format(hour, weekday, rain, (848.7024087859794 * hour) + (430.90383721963343 * weekday) + (20.621268321381383 * rain) + 1885.891938656415)
From all the festure sets I've tested, [hour, weekday, holiday, conds, UNIT] gave me a pretty good R^2 result without including huge number of features. For this case, first I'm plotting the cost history below. We can see it converged to a consistent low value. This means the alpha value 0.1 is a good enough. The second plot shows that how predicted values diverge from the observed value. We can see that some predicted values are negative, which are not possible in reality but could happen for predictions. Even there are negative predictions, according to the third plot of residuals, it still has a mean very close to zero. In addition to this, if the normality of residual distribution is normal, then we can conclude that the model is a good one. However, normality test gave us p-value = 0.00. So even the plot looks normal, it's not. I was wondering if it could be caused by the really huge residuals, so I made one more quantile probability plot to take a look (http://discussions.udacity.com/t/residuals-normality/7035/2). And it seems the case.
In [31]:
features = df[['hour', 'weekday']]
features['holiday'] = get_holiday()
# Add conds to features using dummy variables
dummies = pd.get_dummies(df['conds'], prefix='conds')
features = features.join(dummies)
# Add UNIT to features using dummy variables
dummies = pd.get_dummies(df['UNIT'], prefix='unit')
features = features.join(dummies)
predictions = calc_predictions(features, df_entries_hourly, None, plot_cost=True)
plot_observed_predicted(df_entries_hourly, predictions, 100)
residuals = get_residuals(df_entries_hourly, predictions, plot=True)
probplot_residuals(residuals)
print_features_r2('hour, weekday, holiday, conds, UNIT', df_entries_hourly, predictions)
From the previous experiments, we can say hour, weekday/day_week, station/UNIT and holiday are the more important features. Though in section 1 we already knew that only about 0.06% of the differences was due to rain, but for this project let's take a closer look at the theta of feature "rain." From the results below we can tell that rain if not decreasing the ridership very imperceptibly then would increase the ridership a little. Comining with what's found in the statistical tests in section 1, we can say that rainy days did have significantly higher ridership but it only affect ridership very little as proven by the thetas.
In [32]:
features = df[['hour', 'weekday', 'rain']]
# Add UNIT to features using dummy variables
dummies = pd.get_dummies(df['UNIT'], prefix='unit')
features = features.join(dummies)
predictions = calc_predictions(features, df_entries_hourly, None, print_theta=3)
print_features_r2('hour, weekday, rain, UNIT', df_entries_hourly, predictions)
In [33]:
features = df[['hour', 'weekday', 'rain']]
features['holiday'] = get_holiday()
# Add UNIT to features using dummy variables
dummies = pd.get_dummies(df['UNIT'], prefix='unit')
features = features.join(dummies)
predictions = calc_predictions(features, df_entries_hourly, None, print_theta=4)
print_features_r2('hour, weekday, rain, holiday, UNIT', df_entries_hourly, predictions)
Finally I tried to include as many features as possible. I did get a better R^2. Features related to rainfall have these coefficients:
rain: 41.944959066180211
meanprecipi: 85.454256873370198
precipi: -99.97437901372767
In [34]:
features = df[['weekday', 'day_week', 'rain', 'fog', 'meanprecipi', 'meanpressurei', 'meantempi', 'meanwspdi', 'precipi', 'pressurei', 'tempi', 'wspdi']]
features['holiday'] = get_holiday()
# Add conds to features using dummy variables
dummies = pd.get_dummies(df['conds'], prefix='conds')
features = features.join(dummies)
# Add UNIT to features using dummy variables
dummies = pd.get_dummies(df['UNIT'], prefix='unit')
features = features.join(dummies)
# Add datetime to features using dummy variables
dummies = pd.get_dummies(df['datetime'], prefix='datetime')
features = features.join(dummies)
predictions = calc_predictions(features, df_entries_hourly, None, print_theta=13, plot_cost=True)
plot_observed_predicted(df_entries_hourly, predictions, 100)
residuals = get_residuals(df_entries_hourly, predictions, plot=True)
probplot_residuals(residuals)
print_features_r2('many', df_entries_hourly, predictions)